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Abstract. In this paper, we examine a number of additive and multiplicative multi- 
level iterative methods and preconditioners in the setting of two-dimensional local mesh 
refinement. While standard multilevel methods are effective for uniform refinement- 
based discretizations of elliptic equations, they tend to be less effective for algebraic 
systems which arise from discretizations on locally refined meshes, losing their opti- 
mal behavior in both storage and computational complexity. Our primary focus here is 
on BPX-style additive and multiplicative multilevel preconditioners, and on various sta- 
bilizations of the additive and multiplicative hierarchical basis method (HB), and their 
use in the local mesh refinement setting. In the first two papers of this trilogy, it was 
shown that both BPX and wavelet stabilizations of HB have uniformly bounded condi- 
tions numbers on several classes of locally refined 2D and 3D meshes based on fairly 
standard (and easily implementable) red and red-green mesh refinement algorithms. In 
this third article of the trilogy, we describe in detail the implementation of these types 
of algorithms, including detailed discussions of the datastructures and traversal algo- 
rithms we employ for obtaining optimal storage and computational complexity in our 
implementations. We show how each of the algorithms can be implemented using stan- 
dard datatypes available in languages such as C and FORTRAN, so that the resulting 
algorithms have optimal (linear) storage requirements, and so that the resulting multi- 
level method or preconditioner can be applied with optimal (linear) computational costs. 
Our implementations are performed in both C and MATLAB using the Finite Element 
ToolKit (FEtk), an open source finite element software package. We finish the paper 
with a sequence of numerical experiments illustrating the effectiveness of a number of 
BPX and stabilized HB variants for several examples requiring local refinement. 



Contents 



1 . Introduction 2 

2. Overview of the multilevel methods 4 

3. Implementation 6 

4. Numerical Experiments 1 1 

5. Conclusion 16 
Acknowledgments 16 
References 16 



Date: February 13, 2003. 

Key words and phrases, finite element methods, local mesh refinement, multilevel preconditioning, 
BPX, red and red-green refinement, 2D and 3D, datastructures. 

The first author was supported in part by the Burroughs Wellcome Fund through the LJIS predoctoral 
training program at UC San Diego, in part by NSF (ACI-9721349, DMS-9872890), and in part by DOE 
(W-7405-ENG-48/B341492). Other support was provided by Intel, Microsoft, Alias | Wavefront, Pixar, and 
the Packard Foundation. 

The second author was supported in part by the Howard Hughes Medical Institute, and in part by NSF 
and NIH grants to J. A. McCammon. 

The third author was supported in part by NSF CAREER Award 9875856 and in part by a UCSD 
Hellman Fellowship. 

1 



2 



B. AKSOYLU, S. BOND, AND M. HOLST 



1. Introduction 



While there are a number of effective (often optimal) multilevel methods for uniform 
refinement-based discretizations of elliptic equations, only a handful of these methods 
are effective for algebraic systems which arise from discretizations on locally refined 
meshes, and these remaining methods are typically suboptimal in both storage and com- 
putational complexity. In this paper, we examine a number of additive and multiplicative 
multilevel iterative methods and preconditioners, specifically for two-dimensional local 
mesh refinement scenarios. Our primary focus is on Bramble, Pasciak, and Xu (BPX)- 
style additive and multiplicative multilevel preconditioners, and on stabilizations of the 
additive and multiplicative hierarchical basis method (HB). In [1, 2, 3], it was shown that 
both BPX and wavelet stabilizations of HB have uniformly bounded conditions numbers 
on several classes of locally refined 2D and 3D meshes based on fairly standard (and 
easily implementable) red and red-green mesh refinement algorithms. In this article, we 
describe in detail the implementation of these types of algorithms, including detailed dis- 
cussions of the datastructures and traversal algorithms we employ for obtaining optimal 
storage and computational complexity in our implementations. We show how each of 
the algorithms can be implemented using standard datatypes available in languages such 
as C and FORTRAN, so that the resulting algorithms have optimal (linear) storage re- 
quirements, and so that the resulting multilevel method or preconditioner can be applied 
with optimal (linear) computational costs. Our implementations are performed in both 
C and MATLAB using the Finite Element ToolKit (FEtk), an open source finite element 
software package. We also present a sequence of numerical experiments illustrating the 
effectiveness of a number of BPX and stabilized HB variants for examples requiring local 
refinement. 

The problem class of interest for our purposes here is linear second order partial dif- 
ferential equations (PDE) of the form: 



Here, / G L 2 (0), p, q G L^Q) and p : Q -> 3ft d ), q : Q -> $ft, where p is 

a symmetric positive definite matrix, and q is nonnegative. Let % be a shape regular 
and quasiuniform initial partition of fi into a finite number of <i-simplices, and generate 
7i, %, ■ ■ ■ by refining the initial partition using either red-green or red local refinement 
strategies in d = 2 or d = 3 spatial dimensions. Let Sj be the simplicial linear C° finite 
element (FE) space corresponding to Tj equipped with zero boundary values. The set of 

nodal basis functions for Sj is denoted by {<p[ where Nj = dim Sj is equal to the 
number of interior nodes in Tj. Successively refined FE spaces will form the following 
nested sequence: 



Although the mesh is nonconforming in the case of red refinement, <Sj is used within the 
framework of conforming FE methods for discretizing (1.1). 

Let the bilinear form and the linear functional representing the weak formulation 
of ( 1 . 1 ) be denoted as 



and let us consider the following Galerkin formulation: Find u G Sj, such that 



V -(pVu) + qu = f, ueH^(Q). 



(LI) 



<S c Si c . . . c Sj c . . . c flj(fi). 



(1.2) 




/ v dx, u, v G H (ft), 



a{u,v) = b{v), Wv G Sj. 



(1.3) 
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Employing the expansion u = Y^i=i u i 4>i m tne nodal basis for Sj, problem (1.3) 
reduces to an algebraic equation of the form: 

A {j) u {j) = e ^ (1.4) 

for the combination coefficients u^' G $l Nj . The nodal discretization matrix and vector 
arise then as: 

A® = a(#,#), b?> = &(#), 1 < r,s < Nj. 

Solving the discretized form of (1.3), namely (1.4), by iterative methods, has been 
the subject of intensive research because of the enormous practical impact on a number 
of application areas in computational science. For quality approximation in physical 
simulation, one is required to use meshes containing very large numbers of simplices 
leading to approximation spaces Sj with very large dimension Nj. Only iterative methods 
which scale well with Nj can be used effectively, which usually leads to the use of 
multilevel-type iterative methods and preconditioners. Even with the use of such optimal 
methods for (1.4), which means methods which scale linearly with Nj in both memory 
and computational complexity, the approximation quality requirements on Sj often force 
Nj to be so large that only parallel computing techniques can be used to solve (1.4). 

To overcome this difficulty one employs adaptive methods, which involves the use of 
a posteriori error estimation to drive local mesh refinement algorithms. This approach 
leads to approximation spaces Sj which are adapted to the particular target function u of 
interest, and as a result can achieve a desired approximation quality with much smaller 
approximation space dimension Nj than non-adaptive methods. One still must solve the 
algebraic system (1.4), but unfortunately most of the available multilevel methods and 
preconditioners are no longer optimal, in either memory or computational complexity. 
This is due to the fact that in the local refinement setting, the approximation spaces Sj do 
not increase in dimension geometrically as they do in the uniform refinement setting. As 
a result, a single multilevel V-cycle no longer has linear complexity, and the same diffi- 
culty is encountered by other multilevel methods. Moreover, storage of the discretization 
matrices and vectors for each approximation space, required for assembling V-cycle and 
similar iterations, no longer has linear memory complexity. 

A partial solution to the problem with multilevel methods in the local refinement set- 
ting was provided by the HB method [4, 5, 21]. This method was based on a direct or 
hierarchical decomposition of the approximation spaces Sj rather than the overlapping 
decomposition employed by the multigrid and BPX method, and therefore by construc- 
tion had linear memory complexity as well as linear computational complexity for a 
single V-cycle-like iteration. Unfortunately, the HB condition number is not uniformly 
bounded, leading to worse than linear overall computational complexity. While the con- 
dition number growth is slow (logarithmic) in two dimensions, it is quite rapid (geomet- 
ric) in three dimensions, making it ineffective in the 3D local refinement setting. Recent 
alternatives to the HB method, including both BPX-like methods [7, 8] and wavelet-like 
stabilizations of the HB methods [19], provide a final solution to the condition number 
growth problem. It was shown in [9] that the BPX preconditioner has uniformly bounded 
condition number for certain classes of locally refined meshes in two dimensions, and 
more recently in [2] it was shown that the condition number remains uniformly bounded 
for certain classes of locally refined meshes in three spatial dimensions. In [3], it was 
also shown that wavelet-stabilizations of the HB method gave rise to uniformly bounded 
conditions numbers for certain classes of local mesh refinement in both the two- and 
three-dimensional settings. 
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In view of [2] and [3], our interest in this paper is to examine the practical imple- 
mentation aspects of both BPX and stabilized HB iterative methods and preconditioners. 
In particular, the remainder of the paper is structured as follows. In §2, we review the 
algorithms presented in [2] and [3], giving a unified algorithm framework on which im- 
plementations can be based. The core of the paper is in some sense §3, which describes 
in detail the datastructures and key algorithms employed in the implementation of the 
algorithms. The focus is on practical realization of optimal (linear) complexity of the 
implementations, in both memory and operation complexity. The FEtk software package 
which was leveraged for our implementations is described briefly in §3.3. A sequence of 
numerical experiments with the implementations is presented in §4, illustrating the con- 
dition number growth properties of BPX and stabilized HB methods. Finally, we draw 
some conclusions in §5. 

2. Overview of the multilevel methods 

In the first article [2] of the trilogy, it was shown that the BPX preconditioner was 
optimal on the meshes under the local 2D and 2D red-green, as well as local 2D and 3D 
red, refinement procedures. The classical BPX preconditioner [8, 20] can be written as 
an action of the operator X as follows: 

J Nj 

XU = J2 2 Kd - 2) 5>' « € Sj. (2.1) 

3=0 i=l 

Let the prolongation operator from level j — 1 to j be denoted by Pj_x, and also denote 
the prolongation operator from level j to J as: 

P. = pf = pj , . . . PJ+ 1 e tt v ' xV . 

where Pj is defined to be the identity matrix I 6 ffi NjxNj . Then the matrix representa- 
tion of (2.1) becomes [20]: 

j 

x=j2 2j{d ~ 2)p j p h 

3=0 

One can also introduce a version with a smoother Sf 

j 

x = j2 2j(d ~ 2)p j s 3 p j- 

3=0 

The preconditioner (2.1) can be modified in the hierarchical sense; 

J Nj 

X BB u = J2 2j{d ~ 2) E K^'V?'^^ (2-2) 

j=0 i=Nj-i+l 

The new preconditioner corresponds to the additive HB preconditioner in [21]. The 
matrix representation of (2.2) is formed from matrices Hj which are simply the tails of 
the Pj corresponding to newly introduced degrees of freedom (DOF) in the fine space. 
In other words, Hj G SR N J x { N i- N i-i) is given by only keeping the fine columns (the last 
Nj — iVj_i columns of Pj). Hence, the matrix representation of (2.2) becomes: 

j=0 
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Only in the presence of a geometric increase in the number of DOF, the same assump- 
tion for optimality of a single classical multigrid or BPX iteration, does the cost per 
iteration remain optimal. In the case of local refinement, the BPX preconditioner (2.1) 
(usually known as additive multigrid) can easily be suboptimal because of the subopti- 
mal cost per iteration (see Figure 7). On the other hand, the HB preconditioner (2.2) 
suffers from a suboptimal iteration count. The above deficiencies of the preconditioners 
(2.1) and (2.2) can be overcome by restricting the sum over i in (2.1) only to those nodal 
basis functions with supports that intersect the refinement region [6, 7, 9, 14]. We call 
this set 1-ring of fine DOF, namely, the set which contains fine DOF and their immediate 
neighboring coarse DOF. The following is referred as the BPX preconditioner for local 
refinement. 

Xu = J2^ d -v MfW^ueSj, (2.3) 

j=0 ieONERING 

where ONERTNG= {1 - ring(ii) : ii = JV,-_i + 1, . . . , N,}. 

The BPX decomposition gives rise to basis functions which are not locally supported, 
but they decay rapidly outside a local support region. This allows for locally supported 
approximations as illustrated in Figures 1, 2, and 3. 




Figure 1 . Hierarchical basis function without modification. 




Figure 2. Wavelet modified hierarchical basis function with one itera- 
tion of symmetric Gauss-Seidel approximation, upper and lower view. 

The wavelet modified hierarchical basis (WMHB) methods [17, 18, 19] can be viewed 
as an approximation of the wavelet basis stemming from the BPX decomposition [13]. 
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Figure 3. Wavelet modified hierarchical basis function with one itera- 
tion of Jacobi approximation, upper and lower view. 

A similar wavelet-like multilevel decomposition approach was taken in [16], where the 
orthogonal decomposition is formed by a discrete L 2 -equivalent inner product. This 
approach utilizes the same BPX two-level decomposition [15, 16]. 

For adaptive regimes, other primary method of interest is the WMHB method. The 
WMHB methods can be described as additive or multiplicative Schwarz methods. In one 
of the previous papers [3] of this trilogy, it was shown that the additive version of the 
WMHB method was optimal under certain types of red-green mesh refinement. Follow- 
ing the notational framework in [3, 19], this method is defined recursively as follows: 

Definition 2.1. The additive WMHB method D^' is defined for j = 1, . . . , J as 







B 



if) 

22 



with £>«» = AW. 



With smooth PDE coefficients, optimal results were also established for the multi- 
plicative version of the WMHB method in [3]. Our numerical experiments demonstrate 
such optimal results. This method can be written recursively as: 

Definition 2.2. The multiplicative WMHB method B^ is defined as 



B {i) 



A U) 

B (j) 
D 22 



B 



(j) 
22 



- 1 

A 21 







- 1 ) + A$B® 



A 



21 



l 12 -^22 



/i 21 



A U) 

^22 





with = A<°>. 

Am i, Ayi , A22 represent subblocks of A® and they correspond to coarse-fine, fine- 
coarse, and fine-fine interactions of DOF at level j, respectively. B^ denotes an ap- 
proximation of A22 , e.g. Gauss-Seidel or Jacobi approximation. For a more complete 
description of these and related algorithms, see [2, 3]. 

3. Implementation 

The overall utility of any finite element code depends strongly on efficient implemen- 
tation of its core algorithms and data structures. Theoretical results involving complexity 
are of little practical importance if the methods cannot be implemented. For algorithms 
involving data structures, this usually means striking a balance between storage costs and 
computational complexity. Finding a minimal representation for a data set is only useful 
if the information can be accessed efficiently. 
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3.1. Sparse Matrix Structures. Our implementation relies on a total of four distinct 
sparse matrix data structures: compressed column (COL), compressed row (ROW), diagonal- 
row-column (DRC), and orthogonal- linked list (XLN). Each of these storage schemes 
attempts to record the location and value of the nonzeros using a minimal amount of 
information. The schemes differ in the exact representation which effects the speed and 
manner with which the data can be retrieved. To illustrate how each of these data struc- 
tures works in practice, we consider storing the following sparse matrix: 



1 2 

3 4 5 6 
7 8 

9 10 
11 12 13 



(3.1) 



• COL: The compressed column format is the most commonly used sparse matrix type 
in the literature. It is the format chosen for the Harwell-Boeing matrix collection [11], 
and is used in production codes such as SuperLU [10]. In this data structure, the nonzeros 
are arranged by column in a single double-precision array: 

Acol = [1, 3, 2, 4, 7, 11, 5, 8, 9, 12, 6, 10, 13] . 

The indices of A (often referred to as pointers) corresponding to the first entry in each 
column is then stored in an integer array: 

IAcol = [1, 3, 7, 9, 11, 14] . 

The length of the array IA is always one greater than the number of columns, with the 
last entry is equal to the number of nonzeros plus one. The difference in successive 
entries in the IA array reflects the number of nonzeros in each column. If a column has 
no nonzeros, the index from the next column is repeated. To determine the location of 
each nonzero within its column, the row index of each entry is stored in an integer array: 

JAcol = [1, 2, 1, 2, 3, 5, 2, 3, 4, 5, 2, 4, 5] . 

There is no restriction that the entries are ordered within each column, only that the 
columns are ordered. The memory required to store this datastructure is: (nZ + nC + 
1) * size (int) + nZ * size (double), where nZ and nC are the number of nonzeros and 
columns respectively. 

• ROW: The compressed row data structure is just the transpose of the compressed 
column data structure, where the nonzero entries, row pointers, and column indices are 
stored in A, IA, and JA respectively: 

A ROW = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] , 

IArqw = [1, 3, 7, 9, 11, 14] , JArqw = [1, 2, 1, 2, 3, 5, 2, 3, 4, 5, 2, 4, 5] . 
One should note that since in our example the matrix is structurally symmetric, the IA 
and J A arrays are identical in both the ROW and COL cases. The memory required to 
store this datastructure is: (nZ + nR + 1) * size (int) + nZ * size (double), where nR is 
the number of rows. 

• DRC: The diagonal-row-column format is a structurally symmetric data structure, 
which is only valid for square matrices. In this format, the diagonal is stored in its own 
full vector, while the strictly upper and lower triangular portions are stored in ROW and 
COL formats respectively. Leveraging the symmetry in the nonzero structure, the same 
I A and J A arrays can be used for the upper and lower triangular parts: 

ADdrc = [1, 4, 8, 9, 13] , 
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AU DRC = [2, 5, 6, 10] , AL DRC = [3, 7, 11, 12] 
IA DRC = [1, 2, 4, 4, 5, 5] , JA DRC = [2, 3, 5, 5] . 

The memory required to store this datastructure is less than ROW or COL if the diagonal 
is full, and the matrix is structurally symmetric. 



1 
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13 
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5 
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Figure 4. An illustration of the XLN datastructure. 



• XLN: The orthogonal-linked list format is the only dynamically "fillable" datastruc- 
ture used by our methods. By using variable length linked lists, rather than a fixed length 
array, it is suitable for situations where the total number of nonzeros is not known a 
priori. The XLN datastructure is illustrated graphically in Figure 4. For each nonzero, 
there is a link containing the value, row index, column index, and pointers to the next in 
the row and column. To keep track of the first link in each row and column, there are 
two additional pointer arrays, IP and JP. As long as there are "order-one" nonzeros per 
row, accessing any entry can be accomplished in "order-one" time. The structure can 
be traversed both rowwise, and columnwise depending on the situation. If the matrix is 
symmetric, only the lower triangular portion is stored. The total storage overhead for 
this structure is: nZ * (size (double) + 2 * size (hit)) + (nC + nR + 2 nZ) * size (ptr). 
Although this is considerably more than the other three datastructures, one should note 
that the asymptotic complexity is still linear in the number of nonzeros. 

3.2. Sparse Matrix Products. The key preprocessing step in the hierarchical basis 
methods, is converting the "nodal" matrices and vectors into the hierarchical basis. This 
operation involves sparse matrix-vector and matrix-matrix products for each level of re- 
finement. To ensure that this entire operation has linear cost, with respect to the number 
of unknowns, the per-level change of basis operations must have a cost of O (rij), where 
Nj-i is the number of "new" nodes on level j. For the traditional multigrid 



n ; 



N 3 
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algorithm this is not possible, since enforcing the variational conditions operates on all 
the nodes on each level, not just the newly introduced nodes. 

The linear operator which converts from the nodal to the hierarchical basis can be 
written in terms of a change of basis matrix: 



G = 



I K 12 
K 2 i I + K 22 



where G € ^ xN \ K 12 e ft^- lXn *, K 21 e ^ xN ^-\ and K 22 e 9^ xn *. In this 
representation, we have assumed that the nodes are ordered with the nodes Nj-i inherited 
from the previous level listed first, and the rij new DOF listed second. For both wavelet 
modified (WMHB) and unmodified hierarchical basis (HB), the K 21 block represents the 
last rij rows of the prolongation matrix, P]_\- In the HB case, the K\ 2 and K 22 blocks 
are zero resulting in a very simple form: 



Chb — 



/ 

K 21 I 



(3.2) 



G 



wmhb 



(3.3) 



For WMHB, the K 12 and K 22 blocks are computed using the mass matrix, which results 
in the following formula: 

I -inv [M n h ] Ml 
K 21 I - K 21 mv [M n h ] Ml 

where the inv [•] is some approximation to the inverse which preserves the complexity. 
For example, it could be as simple as the inverse of the diagonal, or a low-order matrix 
polynomial approximation. The M hb blocks are taken from the mass matrix in the HB 
basis: 

M hb = C7^ b M nodal L7 hb . (3.4) 

For the remainder of this section, we restrict our attention to the WMHB case. The HB 
case follows trivially with the two additional subblocks of K set to zero. 

To reformulate the nodal matrix representation of the bilinear form in terms of the 
hierarchical basis, we must perform a triple matrix product of the form: 



a wmhb 



/~tT a nodal/^y 



r (i) 

= ( J + K u)) ^a) dal ( J + K u)) ■ 

In order to keep linear complexity, we can only copy ^4 nodal a fixed number of times, i.e. 
it cannot be copied on every level. Fixed size data-structures are unsuitable for storing 
the product, since predicting the nonzero structure of Aj™ hh is just as difficult as actually 
computing it. It is for these reasons that we have chosen the following strategy: First, 
copy ^4 nodal on the finest level, storing the result in an XLN which will eventually become 
A wmhb . Second, form the product pairwise, contributing the result to the XLN. Third, the 
last rij columns and rows of A wmhb are stripped off, stored in fixed size blocks, and the 
operation is repeated on the next level, using the A n block as the new A nodlil : 

Algorithm 3.1. (Wavelet Modified Hierarchical Change of Basis) 

• Copy A} odal -»• A wmhh in XLN format. 

• While j > 

(1) Multiply A wmhh = A wmhh G as 





A 12 ' 


+ = 


" An 


A 12 ' 







K 12 ' 


A 2X 


A 22 _ 


_ A 2 i 


A 22 _ 




. K21 


K22 _ 
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(2) Multiply A wmhh = G T A wmhh as 

' Au A12 
A21 A22 

(3) Remove A$, A$, A 2 J 2 > blocks of A wmtlb storing in ROW, COL, and DRC formats 
respectively. 

(4) After the removal, all that remains of A wmhb is its A$ block. 

(5) Let j = j - 1, descending to level j — 1. 

• End While. 

• Store the last A wmhh as A coarsc 

We should note that in order to preserve the complexity of the overall algorithm, all of 
the matrix-matrix algorithms must be carefully implemented. For example, the change 
of basis involves computing the products of An with K i2 and Kj 2 . To preserve storage 
complexity, K X2 must be kept in compressed column format, COL. For the actual prod- 
uct, the loop over the columns of K 12 must be ordered first, then a loop over the nonzeros 
in each column, then a loop over the corresponding row or column in A n . It is exactly 
for this reason, that one must be able to traverse An both by row and by column, which 
is why we have chosen an orthogonal-linked matrix structure for A during the change of 
basis (and hence An). 

To derive optimal complexity algorithms for the other products, it is enough to ensure 
that the outer loop is always over a dimension of size rij. Due to the limited ways in 
which a sparse matrix can be traversed, the ordering of the remaining loops will usually 
be completely determined. Further gains can be obtained in the symmetric case, since 
only the upper or lower portion of the matrix needs to be explicitly computed and stored. 



+ 



Kl 
-"■12 ^22 



^11 A 12 
A21 A22 



3.3. The Finite Element ToolKit (FEtk). A number of variations of the methods de- 
scribed above have been implemented using the Finite Element ToolKit (FEtk) [12]. 
FEtk is an open source finite element modeling package which has been developed by 
the Hoist research group over several years at Caltech and UC San Diego, with gener- 
ous contributions from a number of colleagues. FEtk consists of a low-level portability 
library called MALOC (Minimal Abstraction Layer for Object-oriented C), on top of 
which is build a general finite element modeling kernel called MC (Manifold Code). 
Most of the images appearing later in this paper were produced using another compo- 
nent of FEtk call SG (Socket Graphics), which is also built on top of MALOC. FEtk also 
includes a fully functional MATLAB version of MC called MCLite, which shares with 
MC its datastructures, a posteriori error estimation and mesh refinement algorithms, and 
iterative solution methods. All of the preconditioners employed in this paper have been 
implemented by the authors as ANSI-C class library extensions to MC, and as MATLAB 
toolkit-like extensions to MCLite. The two implementations are mathematically equiva- 
lent, although the MCLite implementation is restricted to two spatial dimensions. (The 
MC -based implementation is both two- and three-dimensional.) The extensions to MC 
are distributed as the MCX library, and as MATLAB extensions to MCLite are distributed 
as MCLiteX. 

MALOC, SG, MC, and MCLite are freely redistributable under the GNU General 
Public License (GPL). More information about FEtk can be found at: 

http : / / www . f etk . org 
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4. Numerical Experiments 



Test problem is as follows: 

—V • (p Viz) + q u 
n ■ (p Vu) 

u 

where tt = [0, 1] x [0, 1] and 



P 



1 





g, on Tat, 
0, on T D , 



, and q = 1. 



The source term / is constructed so that the true solution is u = sin nx sin 7r?/. We present 
two experiment sets in which adaptivity is driven by a geometric criterion. Namely, 
the simplices which intersect with the quarter circle centered at the origin with radius 
0.25 and 0.05, in experiment sets I and II respectively, are repeatedly marked for further 
refinement. 

• Boundary conditions for the domain in experiment set I: 

T N = {(x,y) : x = 0,0 <y < 1}U {(x,y) : x = 1,0 <y < 1} 
T D = {(x,y):0<x<l,y = 0}U{(x,y):0<x<l,y = l}. 

• Boundary conditions for the domain in experiment set II: 

T N = {(x,y):0<x<l,y = 0}U{(x,y):0<x<l,y = l} 

U{(x, y) : x = 0, < y < 1} U {(x, y) : x = 1, < y < 1}. 

Stopping criterion: 1 1 error \\a < 10~ 7 . 

In experiment set I, red-green refinement subdivides simplices intersecting an arc of 
radius 0.25 which gives rise to a rapid increase in the number of DOR Although we have 
an adaptive refinement strategy, this indeed creates a geometric increase in the number 
of DOF, see Figure 5. Experiment set II is designed so that a small number of DOF 
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is introduced at each level. In order to do this, green refinement subdivides simplices 
intersecting a smaller arc with radius 0.05. 

Table 1 . MCLite iteration counts for various methods, red-green refine- 
ment driven by geometric refinement, experiment set I. 



Levels 


1 


2 


3 


4 


5 


6 


7 


8 


MG 


1 


4 


7 


7 


7 


6 


6 


6 


M.BPX 


1 


4 


7 


7 


7 


7 


6 


6 


HBMG 


1 


10 


19 


28 


32 


37 


45 


56 


WMHBMG 


1 


6 


12 


13 


16 


17 


17 


17 


PCG-MG 


1 


3 


4 


5 


5 


5 


5 


5 


PCG-M.BPX 




3 


5 


5 


5 


5 


5 


5 


PCG-HBMG 




3 


7 


10 


12 


14 


15 


16 


PCG- WMHBMG 




3 


7 


7 


9 


9 


9 


9 


PCG-A.MG 




8 


13 


17 


20 


21 


23 


24 


PCG-BPX 




6 


12 


14 


17 


17 


18 


18 


PCG-HB 




5 


14 


21 


26 


32 


38 


41 


PCG-WMHB 




5 


12 


15 


19 


20 


21 


21 


Nodes 


16 


19 


31 


55 


117 


219 


429 


835 


DOF 


8 


10 


21 


43 


102 


202 


410 


814 



In all the experiments, we utilize a direct coarsest level solve and smoother is a sym- 
metric Gauss-Seidel iteration. The set of DOF on which smoother acts is the fundamen- 
tal difference between the methods. Classical multigrid methods smooth on all DOF, 
whereas HB-like methods smooth only on fine DOF. WMHB style methods smooth as 
HB methods do. BPX style smoothing is a combination of multigrid and HB style. 

There are four multiplicative methods under consideration: MG, M.BPX, HBMG, and 
WMHBMG. The following is a guide to the tables and figures below. MG will refer to 
a classical multigrid, in particular corresponds to the standard V-cycle implementation. 
HBMG corresponds exactly to the MG algorithm, but where pre- and post- smoothing 
are restricted to fine DOF. M.BPX refers to multiplicative version of BPX. Smoother is 
restricted to fine DOF and their immediate coarse neighbors which are often called as 
the 1-ring neighbors. 1-ring neighbors of the fine nodes can be directly determined by 
the sparsity pattern of the fine-fine subblock A 2 2 of the stiffness matrix. The set of DOF 
over which BPX method smooths is simply the union of the column locations of nonzero 
entries corresponding to fine DOF. Using this observation, HBMG smoother can easily 
be modified to be a BPX smoother. WMHBMG is similar to HBMG, in that both are 
multiplicative methods in the sense of Definition 2.2, but the difference is in the basis 
used. In particular, the change of basis matrices are different as a result of the wavelet 
stabilization, where L 2 -projection to coarser finite element spaces is approximated by 
two Jacobi iterations. 

PCG stands for the preconditioned conjugate gradient method. PCG-A.MG, PCG- 
BPX, PCG-HB, and PCG-WMHB involve the use of additive MG, PBX, HB, and WMHB 
as preconditioners for CG, respectively. In the sense of Definition 2.1, HB and WMHB 
are additive versions of HBMG and WMHBMG respectively. Each preconditioner is 
implemented in a manor similar to that described in [17, 19]. 

Finally, note that Nodes denotes the total number of nodes in the simplicial mesh, 
including Dirichlet and Neumann nodes. The iterative methods view DOF as the union 
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Table 2. MCLite iteration counts for various methods, green refinement 
driven by geometric refinement, experiment set EL 
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of the unknowns corresponding to interior and Neumann/Robin boundary DOF, and these 
are denoted as such. 

The refinement procedure utilized in the experiments are fundamentally the same as 
the 2D red-green described in [2, 3]. We, however, remove the restrictive conditions 
that the simplices for level j + 1 have to be created from the simplices at level j and 
the bisected (green refined) simplices cannot be further refined. Even in this case the 
claimed results seem to hold. Experiments are done in MCLite module of the FEtk 
package. Several key routines from the MCLite software, used to produce most of the 
numerical results in this paper, are given in the appendix. 

Iteration counts are reported in Tables 1 and 2. The optimality of M.BPX, BPX, 
WMHBMG and WMHB is evidenced in each of the experiments with the constant num- 
ber of iterations, independent of the number of DOF. HB and HBMG methods suffer 
from a logarithmic increase in the number of iterations. Among all the methods tested, 
the M.BPX is the closest to MG in terms of low iteration counts. 
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Figure 6. Flop counts for single iteration of multiplicative (left) and 
additive (right) methods, experiment set I. 
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Figure 7. Flop counts for single iteration of multiplicative (left) and 
additive (right) methods, experiment set II. 



However, it should be clearly noted that in the experiments we present below, the cost 
per iteration of the various methods can differ substantially. We report flop counts of a 
single iteration of the above methods, see Figures 6 and 7. In experiment set I, the cost 
per iteration is linear for all the methods. The WMHB and WMHBMG methods are the 
most expensive ones. We would like to emphasize that the refinement in experiment set 
I cannot be a good example for adaptive refinement given the geometric increase in the 
number of DOF. MG exploits this geometric increase and enjoys a linear computational 
complexity. Experiment set II is more realistic in the sense that the refinement is highly 
adaptive and introduces a small number of DOF at each level. One can now observe 
a suboptimal (logarithmic) computational complexity for MG-lrke methods in such re- 
alistic scenarios. In accordance with the theoretical justification, under highly adaptive 
refinement MG methods will asymptotically be suboptimal. Moreover, storage complex- 
ity prohibitively prevents MG-like methods from being a viable tool for large and highly 
adaptive settings. 

Coarser representations of the finest level system (1.4) are algebraically formed by 
enforcing variational conditions. Some methods require further stabilizations in the from 
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Figure 8. Flop counts for variational conditions for experiment set I 
(left) and experiment set II (right). 



o 


mg 


-0- 


multip. bpx 


•-E3- 


hbmg 




wmhbmg 





-0- bpx 
hb 

-« wmhb 



Figure 9. Total flop counts of preconditioned multiplicative (left) and 
additive (right) methods, experiment set II. 



of matrix-matrix products. These form the so-called preprocessing step in multilevel 
methods. The computational cost of variational conditions is the same regardless of 
having a multiplicative or an additive version of the same method. This computational 
cost is orders of magnitude cheaper than the cost of a single iteration. However, this 
is the step where the storage complexity can dominate the overall complexity. Due to 
memory bandwidth problems on conventional machines, one should be very careful with 
the choice of datastructures. Since only the An = A coarse subblock of A is formed 
for the next coarser level, the cost of variational conditional for MG, M.BPX, A.MG, 
and BPX is the cheapest among all the methods. On the other hand, HBMG and HB 
require stabilizations of A 12 and A 21 using to the hierarchical basis. The WMHBMG and 
WMHB methods are more demanding by requiring stabilizations of A i2 , A 2 i, and A 22 
using the wavelet modified hierarchical basis. Wavelet structure creates denser change of 
basis matrix than that of the hierarchical basis. Therefore, preprocessing in the WMHB 
and WMHBMG methods is the most expensive among all the methods. 
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5. Conclusion 

In this paper, we examined a number of additive and multiplicative multilevel iterative 
methods and preconditioners in the setting of two-dimensional local mesh refinement. 
While standard multilevel methods are effective for uniform refinement-based discretiza- 
tions of elliptic equations, they tend to be less effective for algebraic systems which arise 
from discretizations on locally refined meshes, losing both their optimal behavior in both 
storage or computational complexity. Our primary focus here was on BPX-style addi- 
tive and multiplicative multilevel preconditioners, and on various stabilizations of the 
additive and multiplicative hierarchical basis method, and their use in the local mesh 
refinement setting. In the first two papers of this trilogy, it was shown that both BPX 
and wavelet stabilizations of HB have uniformly bounded conditions numbers on several 
classes of locally refined 2D and 3D meshes based on fairly standard (and easily im- 
plementable) red and red-green mesh refinement algorithms. In this third article of the 
trilogy, we described in detail the implementation of these types of algorithms, including 
detailed discussions of the datastructures and traversal algorithms we employ for obtain- 
ing optimal storage and computational complexity in our implementations. We showed 
how each of the algorithms can be implemented using standard datatypes available in 
languages such as C and FORTRAN, so that the resulting algorithms have optimal (lin- 
ear) storage requirements, and so that the resulting multilevel method or preconditioner 
can be applied with optimal (linear) computational costs. 

Our implementations were performed in both C and MATLAB using the Finite El- 
ement ToolKit (FEtk), an open source finite element software package. We presented 
a sequence of numerical experiments illustrating the effectiveness of a number of BPX 
and stabilized HB variants for several examples requiring local refinement. As expected, 
multigrid methods most effective in terms of iteration counts remaining constant as the 
DOF increase, but the suboptimal complexity per iteration in the local refinement setting 
makes the BPX methods the most attractive. In addition, both the additive and multiplica- 
tive WMHB-based methods and preconditioners demonstrated similar constant iteration 
requirements with increasing DOF, yet the cost per iteration remains optimal (linear) 
even in the local refinement setting. Consequently in highly adaptive regimes, the BPX 
methods prove to be the most effective, and the WMHB methods become the second best 
effective. 
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